Abstract: Clustering is a widely studied data mining problem in the text domains. This problem finds numerous applications in classification, visualization, document organization, collaborative filtering and indexing. Large quantity of information from document is present in the form of text. Data is not purely available in text form. It also contains a lot of Side Information, can be different kinds of link in the document, user-access behaviour, document provenance information from web - logs or other non-textual attributes. These attributes may contain large amount of information in the clustering purposes. However, it is difficult to estimate the relative information, when some of information is noisy data. In such situation, it will be risky to integrate this side-information into the mining process, because it can add noise to the process or improve the quality of the illustration for the mining process. An ethical way is needed to perform the mining process, and to maximize the advantages of using this available side information. In this paper, we propose the use of K-means algorithm for better and efficient clustering of the information.
Keywords: Clustering, Information-Retrieval, K-means, Side Information, Text mining.